Generalized linear models (GLM) as the name implies are a generalization of the linear modeling framework to allow for the modeling of response variables (e.g. soil attributes) with non-normal distributions and heterogeneous variances. Whereas linear models are designed for predicting continuous soil properties such as clay content or soil temperature, GLM can be used to predict the presence/absence of argillic horizons (i.e. logistic regression) or counts of a plant species along a transect (i.e. Poisson regression). These generalizations greatly expand the applicability of the linear modeling framework, while still allowing for a similar fitting procedure and interpretation of the resulting models.
In the past in order to handle non-linearity and heterogeneous variances, transformations have been made to the response variable, such as the log(x). However, such transformations complicate the models interpretation because the results refer to the transformed scale (e.g. log(x)). These response transformations are not guaranteed to achieve both normality and constant variance simultaneously. GLM approaches transform the response, but also preserve the scale of the response, and provide separate functions to transform the mean response and variance, known as the link and variance functions respectively. So instead of looking like this:
\(f(y) = \beta_{0} + \beta_{1}x + \varepsilon\)
you get this:
\(g(\mu)\) or \(\eta = \beta_{0} + \beta_{1}x + \varepsilon\)
with \(g(\mu)\) or \(\eta\) symbolizing the link function.
Another alteration of the classical linear model is that with GLM the coefficients are estimated iteratively by maximum likelihood estimation instead of ordinary least squares. This results in the GLM minimizing the deviance, instead of the sum of squares. However, for the Gaussian (i.e. normal) distributions the deviance and sum of squares are equivalent.
Logistic regression is a specific type of GLM designed to model data that has a binomial distribution (i.e. presence/absence, yes/no, or proportional data), which in statistical learning parlance is considered a classification problem. For binomial data the logit link transform is generally used. The effect of the logit transform can be seen in the following figure. It creates a sigmoidal curve, which enhances the separation between the two groups. It also has the effect of ensuring that the values range between 0 and 1.
When comparing a simple linear model vs a simple logistic model we can see the effect of the logit transform on the relationship between the response and predictor variable. As before it follows a sigmoidal curve and prevents predictions from exceeding 0 and 1.
Example 1: Probability of Mollisols (Beaudette & O’Geen, 2009)
Example 2: Probability of Red Clay (Evans & Hartemink, 2014)
Example 3: Probability of Ponding (NRCS Unpublished)
Now that we’ve discussed some of the basic background GLM theory we’ll move on to a real exercise, and address any additional theory where it relates to specific steps in the modeling process. The examples selected for this chapter come from Joshua Tree National Park (JTNP)(i.e. CA794) in the Mojave desert. The problem tackled here is a familiar one: Where can I expect to find argillic horizons on fan piedmonts? Argillic horizons within the Mojave are typically found on fan remnants, which are a stable landform that is a remnant of the Pleistocene (Peterson, 1981). Despite the low relief of most fans, fan remnants are uplands in the sense that they generally don’t receive run-on or active deposition.
With this dataset we’ll encounter some challenges. To start with, fan piedmont landscapes typically have relatively little relief. Since most of our predictors will be derivatives of elevation, that won’t leave us with much to work with. Also, our elevation data comes from the USGS National Elevation dataset (NED), which provides considerably less detail than say LiDAR or IFSAR data (Shi et al., 2012). Lastly our pedon dataset like most in NASIS, hasn’t received near as much quality control as have the components. So we’ll need to wrangle some of the pedon data before we can analyze it. These are all typical problems encountered in any data analysis and should be good practice. Ideally, it would be more interesting to try and model individual soil series with argillic horizons, but due to some of the challenges previously mentioned it would be difficult with this dataset. However, at the end we’ll look at one simple approach to try and separate individual soil series with argillic horizons.
To start, as always we need to load some extra packages. This will become a familiar routine every time you start R. Most of the basic functions we need to develop a logistic regression model are contained in base R, but the following contain some useful spatial and data manipulation functions. Believe it or not we will use all of them and more.
library(aqp) # specialized soil classes and functions
library(soilDB) # NASIS and SDA import functions
library(raster) # guess
library(sf) # vector data functions
library(mapview) # interactive mapping
library(ggplot2) # graphing
library(tidyr) # data manipulation
library(caret) # classification and regression training
library(car) # additional regression tools
Hopefully like all good soil scientists and ecological site specialists you enter your field data into NASIS. Better yet hopefully someone else did it for you! Once data are captured in NASIS it is much easier to import the data into R, extract the pieces you need, manipulate it, model it, etc. If it’s not entered into NASIS, it may as well not exist. For this exercise we’ll load a cached dataset on GitHub.
# pedons <- fetchNASIS()
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata"
load(url(githubURL))
# Examine the makeup of the data we imported from NASIS
str(pedons, max.level = 2)
## Formal class 'SoilProfileCollection' [package "aqp"] with 11 slots
## ..@ idcol : chr "peiid"
## ..@ hzidcol : chr "phiid"
## ..@ hzdesgncol : chr "hzname"
## ..@ hztexclcol : chr "texture"
## ..@ depthcols : chr [1:2] "hzdept" "hzdepb"
## ..@ metadata :'data.frame': 1 obs. of 2 variables:
## ..@ horizons :'data.frame': 5163 obs. of 54 variables:
## ..@ site :'data.frame': 1220 obs. of 122 variables:
## ..@ sp :Formal class 'SpatialPoints' [package "sp"] with 3 slots
## ..@ diagnostic :'data.frame': 2161 obs. of 4 variables:
## ..@ restrictions:'data.frame': 0 obs. of 0 variables
Generally before we begin modeling you should spend some time exploring the data. By examining a simple summary we can quickly see the breakdown of how many argillic horizons we have. Unfortunately, odds are good that all the argillic horizons haven’t been consistently populated in the diagnostic horizon table like they should be. Luckily for us, the desert argillic horizons always pop up in the taxonomic name, so we can use pattern matching to extract it. By doing this we gain an additional 11 pedons with argillic horizons and are able to label the missing values (i.e. NA). At a minimum for modeling purposes we probably need 10 pedons of the target we’re interested in and a total of 100 observations overall.
# Check consistency of argillic horizon population
# get the site table
s <- site(pedons)
# tabulate the number of argillic horizons observed
table(s$argillic.horizon, useNA = "ifany")
##
## FALSE TRUE <NA>
## 790 263 167
# or
# summary(s$argillic.horizon)
# Extract argillic presence from the taxonomic subgroup
s$argillic <- grepl("arg", s$taxsubgrp)
table(s$argillic, useNA = "ifany")
##
## FALSE TRUE
## 1022 198
Ideally, if the diagnostic horizon table had been populated consistently we could have used the upper depth to diagnostic feature to filter out argillic horizons that start below 50cm, which may not be representative of “good” argillic horizons and may therefore have gotten correlated to a Torripsamments anyway. Not only are unrepresentative sites confusing for scientists, they’re equally confusing for models. However, as we saw earlier, some pedons don’t appear to be fully populated, so we’ll stick with those pedons that have the argillic specified in their taxonomic subgroup name, since it gives us the biggest sample.
d <- diagnostic_hz(pedons)
d_sub <- subset(d, featkind == "argillic horizon" & featdept < 50)
s$argillic.horizon50 <- ifelse(s$peiid %in% d_sub$peiid, TRUE, FALSE)
table(s$argillic.horizon50, useNA = "ifany")
##
## FALSE TRUE
## 998 222
Another obvious place to look is at the geomorphic data in the site table. This information is intended to help differentiate where our soil observations exist on the landscape. If populated consistently it could potentially be used in future disaggregation efforts, as demonstrated by Nauman and Thompson (2014).
# Landform vs argillic presence
# Subset
s_sub <- subset(s, argillic == TRUE)
# Cross tabulate landform vs argillic horizon presence
test <- with(s_sub,
table(landform, argillic, useNA = "ifany")
)
# Subset and print landform.string with > 3 observations
test[test > 3, ]
## alluvial fans fan aprons
## 6 19
## fan aprons on fan remnants fan remnants
## 4 72
## hills hillslopes
## 15 28
## low hills mountain slopes
## 5 8
## mountains pediments
## 4 9
## <NA>
## 7
# generalize the landform.string
s$landform_generic <- ifelse(grepl("fan|terrace|sheet|drainageway|wash", s$landform), "fan", "hill")
Examining the above frequency table we can see that argillic horizons occur predominantly on fan remnants as was alluded too earlier. However, they also seem to occur frequently on other landforms - some of which are curious combinations of landforms or redundant terms.
# Hillslope position
# Subset fan landforms
s_sub <- subset(s, landform_generic == "fan")
# Cross tabulate and calculate proportions, the "2" calculates the proportions relative to the column totals
with(s_sub, round(
prop.table(table(hillslopeprof, argillic, useNA = "ifany"), 2)
* 100)
)
## argillic
## hillslopeprof FALSE TRUE
## summit 16 39
## shoulder 4 2
## backslope 14 20
## footslope 2 1
## toeslope 16 4
## <NA> 48 34
# Slope shape
with(s_sub, round(
prop.table(table(paste(shapedown, shapeacross), argillic, useNA = "ifany"), 2)
* 100)
)
## argillic
## FALSE TRUE
## concave concave 1 0
## concave convex 0 0
## concave linear 4 3
## convex concave 0 1
## convex convex 7 7
## convex linear 6 9
## linear concave 6 1
## linear convex 21 32
## linear linear 44 38
## linear NA 0 0
## NA NA 11 10
Looking at the hillslope position of fan landforms we can see a slightly higher proportion of argillic horizons are found on summits, while less are found on toeslopes. Slope shape doesn’t seem to provide any useful information for distinguishing argillic horizons.
# Subset to just look and fans, and select numeric columns
s_sub <- subset(s, landform_generic == "fan", select = c(argillic, bedrckdepth, slope, elev, surface_total_frags_pct))
# convert s_sub to wide data format
s_w <- reshape2::melt(s_sub, id.vars = "argillic", measure.vars = c("bedrckdepth", "slope", "elev", "surface_total_frags_pct"))
head(s_w, 2)
## argillic variable value
## 1 FALSE bedrckdepth NA
## 2 FALSE bedrckdepth 11
library(ggplot2)
ggplot(s_w, aes(x = argillic, y = value)) +
geom_boxplot() +
facet_wrap(~ variable, scale = "free")
Looking at boxplots of our numeric variables we can see none of them show much separation between the presense/absense of argillic horizons.
Next we’ll look at soil scientist bias. The question being: Are some soil scientists more likely to describe argillic horizons than others? Due to the excess number of soil scientist that have worked on CA794, including detailees, we’ve filtered the names of soil scientist to include just the top 3 mappers and given priority to the most senior soil scientists when they occur together.
# Custom function to filter out the top 3 soil scientists
s <- within(s, {
old = descname
descname2 = NA
descname2[grepl("Stephen", old)] = "Stephen" # least senior
descname2[grepl("Paul", old)] = "Paul"
descname2[grepl("Peter", old)] = "Peter" # most senior
})
s_sub <- subset(s, landform_generic == "fan")
# By frequency
with(s_sub, table(descname2, argillic, useNA = "ifany"))
## argillic
## descname2 FALSE TRUE
## Paul 77 28
## Peter 268 29
## Stephen 66 13
## <NA> 157 45
# By proportion
with(s_sub, round(
prop.table(table(descname2, argillic), margin = 1)
* 100)
)
## argillic
## descname2 FALSE TRUE
## Paul 73 27
## Peter 90 10
## Stephen 84 16
For fan landforms, one of the soil scientists seems more likely than the others to describe argillic horizons. However while this information is suggestive, it is far from definitive in showing a potential bias because it doesn’t take into account other factors. We’ll examine this more closely later.
Where do our points plot? To start we need to convert them to a spatial object first. Then we can create an interactive we map using mapview. Also, if we wish we can also export the locations as a Shapefile.
library(sf)
library(dplyr)
# create an index to remove sites without coordinates
idx <- complete.cases(s$x_sdt, s$y_std)
s_sub <- s[idx, ]
# convert to sites to a spatial object
s_sf <- st_as_sf(s_sub,
coords = c("x_std", "y_std"),
crs = 4326
) %>%
# reproject
st_transform(crs = 5070)
# Read in soil survey area boundaries
ca794 <- read_sf(dsn = "D:/geodata/soils/soilsa_a_nrcs.shp", layer = "soilsa_a_nrcs") %>%
# subset out Joshua Tree National Park
filter(areasymbol == "CA794") %>%
# reproject
st_transform(crs = 5070)
# Plot
library(mapview)
mapview(ca794, fill = NA) +
mapview(s_sf, zcol = "argillic")